Computer vision - Lab 5

Agenda

Helpers

Datasets

In this script we will be using:

ARC is a dataset of task. A task should be understood as an element consisting of a demonstration of an input and output pair and a query for a given input what is the correct output. Together, they form something like popular intelligence tests where you must demonstrate understanding of some abstract concepts and transformations based on a demonstration and apply them to a query.

The following code loads data into a Task class object. This class contains the following fields:

Note: raster pixels contain numeric data ranging from 0-9, where specific values represent classes. The classes are disordered, that is, there is no < nor > operation.

Pixel description (Classic)

An intensity-domain 2D image is a 2D projection of a 3D scene in which the image is represented as a 2D matrix containing pixels. There are many equivalent notations such as RGB, CMYK, HSV, etc., in which each pixel is written in a different way (using different representations).

Accordingly, each representation of a 3D scene is a description of the incident light on the camera matrix, so pixels should be treated as descriptors of the light incident on a given element.

As each pixel is a descriptor of a certain element in the scene (from which incident light is reflected), it seems true to say that neighboring pixels in the intensity domain are correlated with each other and there is a cause-and-effect relationship between them. The above interpretation has become the basis for the development of countless methods of pixel description based on its and its neighborhood representation, in order to analyze the content of the 3D scene.

We will call any operation that describes a given pixel or its relationship to a neighborhood a descriptor. The term descriptor(pixel) is also used in reference to the value that the descriptor (operation) returns. In other words, a descriptor is a set of features describing a given object (the whole image, single pixels) or a function that creates these features.

Building descriptors of simple features

By loading an image into memory in the form of a 2D raster, we obtain the first, one of the simpler descriptors. Each pixel is described with 3 values determining the "content" of blue, green and red colors.

Another descriptor may be the form "grayscale", in which each pixel has an associated grayscale value. It is one value between 0-255.

Having many descriptors for a given pixel, we can combine them or create new descriptors based on them. One of the simpler operations of folding descriptors is to append values from all descriptors to form one longer vector.

Below is presented a descriptor that combines an image in BGR and Grayscale space, as a result describing each pixel with four values (Blue, Green, Red, Grayscale).

From the previously known (convolutional) image filtering methods, we can distinguish low-pass and high-pass filters. Below is an example of creating two descriptors composed of low pass and high pass filter values.

Descriptors can be based on any operation. These can be both complex operations based on machine learning systems, as well as precise algorithms describing specific, previously known concepts.

For the example of an image (any image) in ARC below, the method below describes the neighborhood of a given pixel.

The method consists in checking if the adjacent pixel (with the given location) is the same as the central pixel.

Task 1

For any ARC image, suggest a descriptor that will calculate the number of the same adjacent pixels as the center pixel.

Then find the pixels they will have:

Parameters:

Detection and matching of key points

Computing descriptors for each pixel is often inefficient and very time consuming. Often there are areas that contain exactly the same pixels. The solution is to find characteristic points in the picture called key points. Key points are places that are unique to an image. Their presence allows you to describe the image and compare it with other images.

Comparing images with each other based on key points is called keypoint matching. It involves comparing the key points from both images (peer to peer) and finding for each of the input key points the closest key point from the second image). In this arrangement, one image is treated as a query (input).

There is also a variant where we find multiple matches for each input key point.

Detection of key points

The key point detection mechanism is closely related to the descriptor itself. For example, for the Laplasian edge detection, it can be proposed to find pixels with extreme values (e.g. -4 and 4).

For an ARC image, let's define a descriptor that counts adjacent pixels different from the center pixel.

Note: The descriptor is based on the solution of Task 1.

A key point is one where the neighbors have great variety in color, class from the key point. Let's only select pixels that have at least 6 different pixels and are not at the border of the image.

Matching key points

Next, let's define another descriptor that convolution sets neighboring pixels as a vector.

For all detected key points in the previous section, let's get the corresponding descriptors as calculated above.

Now let's perform analogous transformations for another image from the same ARC task.

Thus, for 2 images we found its key points and we calculated an additional descriptor that should describe a given pixel and its surroundings well.

The next step in the matching mechanism is to define the matching function. The matching function is strictly dependent on how the descriptor works. For data in the same space, a distance / similarity function of the two descriptors can be proposed (e.g. if the data is in a Euclidean space, the Euclidean distance of the two vectors can be computed).

For the above configuration, in which the pixel descriptor is its adjacent pixels arranged in a certain order, we can propose a similarity function that computes what percentage of the components of the two descriptors is the same.

For the above results, the following facts can be noted:

Therefore, matches should only occur if the similarity is greater than a certain threshold.

Let's define additional information for the similarity of the two images:

Finally, to find out if the images are similar or not, you can add a condition on the similarity statistics calculated above. The simplest condition will be the cut off condition:

Task 2 optional, for additional points for the final grade

Write an algorithm that finds all similar images for the input image (query) and the reference image set. The algorithm should specify the following steps:

Note:

Pixel description (OpenCV)

The OpenCV library contains a set of implementations of popular descriptors and additional functionalities for finding matches and for displaying processing results.

Each of the descriptors in the OpenCV library has the following functions:

Harris Corner Detector

Algorithm presented in the scientific article 'A Combined Corner and Edge Detector' in 1988 by authors Chris Harris and Mike Stephens.

The algorithm uses the sum of the changes in the intensity of adjacent pixels to find flat areas, edges, and corners. The first step in finding these areas is to expand the formula for the above sum into a Taylor series and then calculate a value based on the determinant and trace of the gradient matrix, which will directly indicate the classification of the area.

  1. The sum of the intensities of adjacent pixels: $$E(u,v) = \sum_{x,y} w(x,y) \, [I(x+u,y+v)-I(x,y)]^2$$
  2. Expansion with Taylor series: $$E(u,v) \approx \begin{bmatrix} u & v \end{bmatrix} M \begin{bmatrix} u \\ v \end{bmatrix}$$ $$M = \sum_{x,y} w(x,y) \begin{bmatrix}I_x I_x & I_x I_y \\ I_x I_y & I_y I_y \end{bmatrix}$$
  3. Calculation of the final value:
$$R = det(M) - k(trace(M))^2$$$$det(M) = \lambda_1 \lambda_2$$$$trace(M) = \lambda_1 + \lambda_2$$

Determining whether a given area is a corner, edge or flat area is made in accordance with the following:

harris_region.jpg

SIFT - Scale Invariant Feature Transform

The author of the algorithm is D.Lowe, who presented it for the first time in the work Distinctive Image Features from Scale-Invariant Keypoints in 2004.

It is an algorithm that determines key points and their descriptors based on gradients in their close surroundings. The next steps of the algorithm are as follows:

  1. The input image is scaled multiple times (eg x2, x4, x8, x16) and then Gauss filters are made on each scale (eg size 2, 4, 6, 8).

  2. The difference operation is performed between neighboring Gaussians in a given scale (directly on the image intensities).

sift_dog.jpg

  1. Each pixel is compared with its surroundings (eg 8 neighboring pixels - 3x3) and it is checked if a given pixel is an extreme value.

  2. Similarly, the same pixel is compared with the pixels on a lower and higher scale (scale = Gaussian filter size).

sift_local_extrema.jpg

  1. The extreme end values are key point candidates. To select key points, an additional filtering operation is performed based on the contrast of the pixel's surroundings.

  2. For key points, their surroundings are selected and gradients are calculated for each surrounding pixel. These gradients are aggregated with respect to their surroundings in histograms.

Screenshot from 2021-04-14 12-34-24.png

  1. On the histograms, discretization is performed by dividing the histogram eg every 45 degrees, which results in 8 values.

  2. The final step is to combine all values from the environment into one final descriptor. For the illustration above, we have 4 mini-regions and each will have 8 values. Ultimately, the descriptor will have a size of 32 (each key point will have 32 values).

BRIEF - Binary Robust Independent Elementary Features

The algorithm proposed in the research paper '' BRIEF: Binary Robust Independent Elementary Features '' by the authors Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua.

The algorithm is used to calculate descriptors for a given pixel and its surroundings and does not include the stage of key point detection, so it must always be applied to known (previously discovered) key points.

The algorithm is as follows:

  1. For a given neighborhood size (eg 3, 3), n random pairs are selected.
  1. The neighborhood of the key point is selected.
  1. For the key point, it is checked if the values from the previously selected pairs of neighbors are the same (greater).

BRIEF in OpenCV

The OpenCV implementation operates on bits, but the data (due to the specificity of Python) are returned in the form of bytes. This means that if when creating the descriptor we choose its size as 32 bytes, the BRIEF will look for 256 pairs for each pixel neighborhood.

The result will be returned in bytes (which makes no difference, because the descriptors are compared with Hamming distance).

FAST - Features from Accelerated Segment Test

Algorithm for corner detection by means of a simple and quick test of the pixel's surroundings. Proposed in the paper 'Machine learning for high-speed corner detection' by Edward Rosten and Tom Drummond.

This approach is characterized by a very short duration of action and a relatively high effectiveness of action.

The algorithm performs the following operations:

1.The intensity of the next pixel is taken ($I (x, y)$) and the cut-off threshold is selected ($ \lambda_ {x, y} $), based on the intensity taken,

  1. Then a radius and pixels are selected on a circle with this radius and center at (x, y).

Screenshot from 2021-04-14 13-36-35.png

  1. If the intensity n pixels of the pixels along the circle satisfy any of the following properties, then we say that this area has a corner: $$I(x', y') > I(x, y) + \lambda_{x, y}$$ $$I(x', y') < I(x, y) - \lambda_{x, y}$$

ORB - Oriented FAST and Rotated BRIEF

An algorithm that combines the approach proposed by FAST and BRIEF. Presented in ORB: An efficient alternative to SIFT or SURF by Ethan Rublee, Vincent Rabaud, Kurt Konolige and Gary R. Bradski.

Matching descriptors

The matching of key points is done on the basis of their descriptors. OpenCV includes an object implementation to find both best and k-best matches.

In order to obtain correct results, the appropriate descriptor similarity function should be used. For descriptors that operate in the Euclidean space, this may be the normal Euclidean distance. For descriptors such as BRISK, which produce binary descriptors, the Hamming distance would be a better proposition.

Let's load two sample images that show the same object from different angles:

Then, for their representation in the Grayscale color space, let's find the key points and their descriptors using the algorithms below.

Additionally, for the FAST and STAR key points, let's count the BRIEF descriptors.

For the above generated data, we will check the best fit as well as the k-nearest matches, based on Euclidean and Hamming distance.

Closest match

K-nearest matches

Task 3

Implement an algorithm that will find the most similar images to your query. The algorithm should work for any key point detector (a descriptor with a key point detection function) and a descriptor (an object that returns its descriptor for a given key point).

Use the Cats and Dogs (Microsoft) dataset available on the [site] (https://www.microsoft.com/en-us/download/details.aspx?id=54765) (or use the download instructions in the code below ).

Sub-items to be performed: